A Framework for Clustering Mixed Attribute Type Datasets

نویسندگان

  • Jongwoo Lim
  • Jongeun Jun
  • Seon Ho Kim
  • Dennis McLeod
چکیده

We propose a clustering framework that supports clustering of datasets with mixed attribute type (numerical, categorical), while minimizing information loss during clustering. Real world datasets such as medical datasets and its ontology have mixed attribute type datasets. However, most conventional clustering algorithms have been designed and applied to datasets containing only single attribute type (either numerical or categorical). Recently, approaches to clustering for mixed attribute type datasets have emerged, but they are mainly based on transforming attributes to straightforwardly utilize conventional algorithms. The problem of such approaches is the possibility of distorted results due to the loss of information because significant portion of attribute values can be removed in the transforming process. This results in a lower accuracy clustering. To address this problem, we propose a clustering framework for mixed attribute type datasets without transforming attributes. We first utilize an entropy based measure of categorical attributes as our criterion function for similarity. Second, based on the results of entropy based similarity, we extract candidate cluster numbers and verify our weighting scheme with pre-clustering results. Finally, we cluster the mixed attribute type datasets with the extracted candidate cluster numbers and the weights. Our experimental results demonstrate that the proposed framework is effective in increasing accuracy.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Framework for Optimal Attribute Evaluation and Selection in Hesitant Fuzzy Environment Based on Enhanced Ordered Weighted Entropy Approach for Medical Dataset

Background: In this paper, a generic hesitant fuzzy set (HFS) model for clustering various ECG beats according to weights of attributes is proposed. A comprehensive review of the electrocardiogram signal classification and segmentation methodologies indicates that algorithms which are able to effectively handle the nonstationary and uncertainty of the signals should be used for ECG analysis. Ex...

متن کامل

Affinity Learning for Mixed Data Clustering

In this paper, we propose a novel affinity learning based framework for mixed data clustering, which includes: how to process data with mixed-type attributes, how to learn affinities between data points, and how to exploit the learned affinities for clustering. In the proposed framework, each original data attribute is represented with several abstract objects defined according to the specific ...

متن کامل

Clustering Large Data with Mixed Values Using Extended Fuzzy Adaptive Resonance Theory

Clustering is one of the technique or approach in content mining and it is used for grouping similar items. Clustering software datasets with mixed values is a major challenge in clustering applications. The previous work deals with unsupervised feature learning techniques such as k-Means and C-Means which cannot be able to process the mixed type of data. There are several drawbacks in the prev...

متن کامل

HIMIC : A Hierarchical Mixed Type Data Clustering Algorithm

Clustering is an important data mining technique. There are many algorithms that cluster either numeric or categorical data. However few algorithms cluster mixed type datasets with both numerical and categorical attributes. In this paper, we propose a similarity measure between two clusters that enables hierarchical clustering of data with numerical and categorical attributes. This similarity m...

متن کامل

Clustering Mixed Numeric and Categorical Data: A Cluster Ensemble Approach

Clustering is a widely used technique in data mining applications for discovering patterns in underlying data. Most traditional clustering algorithms are limited to handling datasets that contain either numeric or categorical attributes. However, datasets with mixed types of attributes are common in real life data mining applications. In this paper, we propose a novel divide-and-conquer techniq...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012